In [1]:
using DataFrames
using Distributions
using Gadfly
using Optim

Example 2.2.1 (pp. 19–23)

The numbers of chronic medical conditions reported by samples of women living in large country towns or in more rural areas in New South Wales, Australia.


In [2]:
T = [0, 1, 1, 0, 2, 3, 0, 1, 1, 1, 1, 2, 0, 1, 3, 0, 1, 2, 1, 3, 3, 4, 1, 3, 2, 0];
C = [2, 0, 3, 0, 0, 1, 1, 1, 1, 0, 0, 2, 2, 0, 1, 2, 0, 0, 1, 1, 1, 0, 2];

Under the null hypothesis, $H_0$, $\theta_{\text{T}} = \theta_{\text{C}} = \theta$, while under the alternative hypothesis, $H_1$, $\theta_{\text{T}} \neq \theta_{\text{C}}$.

If $H_0$ is true, the log-likelihood function for the counts is

$$\ell_0 = \ell(\theta; \mathbf{y}) = \sum_i (\text{T}_i \log \theta - \theta - \log \text{T}_i !) + \sum_i (\text{C}_i \log \theta - \theta - \log \text{C}_i !).$$

The maximum likelihood estimate is

$$\hat{\theta}_0 = \frac{\sum_i \text{T}_i + \sum_i \text{C}_i}{\lvert \text{T}\rvert + \lvert \text{C}\rvert}.$$

In [3]:
θ̂₀ = (sum(T) + sum(C)) / (size(T, 1) + size(C, 1))


Out[3]:
1.183673469387755

In [4]:
(θ, Y) = mapreduce((y) -> y * log(θ) - θ - log(factorial(y)), +, 0, Y);

In [5]:
ℓ₀(θ) = (θ, T) + (θ, C);

In [6]:
ℓ̂₀ = ℓ₀(θ̂₀)


Out[6]:
-68.3868179494798

If $H_1$ is true, the log-likelihood for the counts is

$$\ell_1 = \ell(\theta_\text{T}, \theta_\text{C}; \mathbf{y}) = \sum_i (\text{T}_i \log \theta_\text{T} - \theta_\text{T} - \log \text{T}_i !) + \sum_i (\text{C}_i \log \theta_\text{C} - \theta_\text{C} - \log \text{C}_i !) .$$

The maximum likelihood estimates are

$$\hat{\theta}_{1\text{T}} = \frac{\sum_i \text{T}_i}{\lvert \text{T}\rvert}$$

and

$$\hat{\theta}_{1\text{C}} = \frac{\sum_i \text{C}_i}{\lvert \text{C}\rvert}.$$

In [7]:
θ̂₁t = sum(T) / size(T, 1)


Out[7]:
1.4230769230769231

In [8]:
θ̂₁c = sum(C) / size(C, 1)


Out[8]:
0.9130434782608695

In [9]:
ℓ₁(θt, θc) = (θt, T) + (θc, C);

In [10]:
ℓ̂₁ = ℓ₁(θ̂₁t, θ̂₁c)


Out[10]:
-67.02295175203457

The maximum value of the log-likelihood function $\ell_1$ will always be greater than or equal to that of $\ell_0$ as one more parameter has been fitted. To decide whether the difference is statistically significant, we need to know the sampling distribution of the log-likelihood function.

[Discussed later.]


In [11]:
xs = linspace(0, 2.5)[2:end]

ys = Array{Float64}(size(xs))

for i in eachindex(xs)
    ys[i] = ℓ₀(xs[i])
end

plot(x = xs, y = ys, Geom.line)


Out[11]:
x -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 -2.5 0.0 2.5 5.0 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 -400 -390 -380 -370 -360 -350 -340 -330 -320 -310 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 -400 -200 0 200 -400 -380 -360 -340 -320 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 y

In [12]:
θ̃₀ = optimize((θ) -> -ℓ₀(θ), 0, 2).minimum


Out[12]:
1.1836734533008915

In [13]:
θ̃₀ = θ̂₀


Out[13]:
1.183673469387755

In [14]:
d₀ = Poisson(θ̂₀)
d₁t = Poisson(θ̂₁t)
d₁c = Poisson(θ̂₁c);

In [15]:
x = 10
xs = 0:x

Ht = hist(T, -1:x)[2]
Hc = hist(C, -1:x)[2]
Htc = DataFrame(
    count = xs,
    T = Ht, C = Hc,
    T̂₀ = pdf(d₀, xs) .* length(T), C_hat_0 = pdf(d₀, xs) .* length(T),
    T̂₁ = pdf(d₁t, xs) .* length(T), Ĉ₁ = pdf(d₁c, xs) .* length(T)
)


WARNING: tty_size is deprecated. use `displaysize(io)` as a replacement
 in depwarn at deprecated.jl:73
 in tty_size at deprecated.jl:924
 in show at /home/scottc/.julia/v0.5/DataFrames/src/abstractdataframe/show.jl:440
 [inlined code] from expr.jl:8
 in showcompact at show.jl:1425
 in writemime at replutil.jl:4
 [inlined code] from expr.jl:8
 in writemime at multimedia.jl:43
 in sprint at strings/io.jl:38
 in display_dict at /home/scottc/.julia/v0.4/IJulia/src/execute_request.jl:26
 in execute_request_0x535c5df2 at /home/scottc/.julia/v0.4/IJulia/src/execute_request.jl:212
 [inlined code] from dict.jl:733
 in eventloop at /home/scottc/.julia/v0.4/IJulia/src/IJulia.jl:141
 in anonymous at task.jl:435
while loading /home/scottc/.julia/v0.4/IJulia/src/kernel.jl, in expression starting on line 31
WARNING: tty_size is deprecated. use `displaysize(io)` as a replacement
Out[15]:
countTCT̂₀C_hat_0T̂₁Ĉ₁
10697.9599527876083987.9599527876083986.26525703689988210.4338263291891
211089.4219849322711659.4219849322711658.9159427063575259.526537083172656
32455.5762767966502825.5762767966502826.3440361564467014.349071277100561
43512.2001636340524922.2001636340524923.00935048446830681.3236303886827792
54100.65106883048492110.65106883048492111.07063430697430140.3021330235036778
65000.154130580278062950.154130580278062950.30471899506191660.055172117335454206
76000.0304067131160804460.0304067131160804460.072273094982634070.008395756985829988
87000.0051416599438270150.0051416599438270150.0146928819470190150.0010950987372821722
98000.00076075580801522150.00076075580801522150.00261363765403703660.0001249840950159001
109000.000100054051847806890.000100054051847806890.00041326749230500151.2679545871178271e-5
1110001.184313266769959e-51.184313266769959e-55.881114313571176e-51.1576976664988854e-6

In [16]:
Htcₗ = stack(Htc, collect(2:7))


Out[16]:
variablevaluecount
1T6.00
2T10.01
3T4.02
4T5.03
5T1.04
6T0.05
7T0.06
8T0.07
9T0.08
10T0.09
11T0.010
12C9.00
13C8.01
14C5.02
15C1.03
16C0.04
17C0.05
18C0.06
19C0.07
20C0.08
21C0.09
22C0.010
23T̂₀7.9599527876083980
24T̂₀9.4219849322711651
25T̂₀5.5762767966502822
26T̂₀2.2001636340524923
27T̂₀0.65106883048492114
28T̂₀0.154130580278062955
29T̂₀0.0304067131160804466
30T̂₀0.0051416599438270157
&vellip&vellip&vellip&vellip
 in depwarn at deprecated.jl:73
 in tty_size at deprecated.jl:924
 in getchunkbounds at /home/scottc/.julia/v0.5/DataFrames/src/abstractdataframe/show.jl:199
 in showrows at /home/scottc/.julia/v0.5/DataFrames/src/abstractdataframe/show.jl:338
 in show at /home/scottc/.julia/v0.5/DataFrames/src/abstractdataframe/show.jl:456
 [inlined code] from expr.jl:8
 in showcompact at show.jl:1425
 in writemime at replutil.jl:4
 [inlined code] from expr.jl:8
 in writemime at multimedia.jl:43
 in sprint at strings/io.jl:38
 in display_dict at /home/scottc/.julia/v0.4/IJulia/src/execute_request.jl:26
 in execute_request_0x535c5df2 at /home/scottc/.julia/v0.4/IJulia/src/execute_request.jl:212
 [inlined code] from dict.jl:733
 in eventloop at /home/scottc/.julia/v0.4/IJulia/src/IJulia.jl:141
 in anonymous at task.jl:435
while loading /home/scottc/.julia/v0.4/IJulia/src/kernel.jl, in expression starting on line 31
WARNING: tty_size is deprecated. use `displaysize(io)` as a replacement
 in depwarn at deprecated.jl:73
 in tty_size at deprecated.jl:924
 in writemime at /home/scottc/.julia/v0.5/DataFrames/src/abstractdataframe/io.jl:144
 in sprint at strings/io.jl:38
 in display_dict at /home/scottc/.julia/v0.4/IJulia/src/execute_request.jl:39
 in execute_request_0x535c5df2 at /home/scottc/.julia/v0.4/IJulia/src/execute_request.jl:212
 [inlined code] from dict.jl:733
 in eventloop at /home/scottc/.julia/v0.4/IJulia/src/IJulia.jl:141
 in anonymous at task.jl:435
while loading /home/scottc/.julia/v0.4/IJulia/src/kernel.jl, in expression starting on line 31

In [17]:
plot(
    Htcₗ[(col -> (col .== :C) | (col .== :C_hat_0))(Htcₗ[:variable]), :],
    x = :count, y = :value, color = :variable,
    Geom.bar(position = :dodge)
)


Out[17]:
count -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 -40 -20 0 20 40 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 C C_hat_0 variable -15 -10 -5 0 5 10 15 20 25 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 -10 0 10 20 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 value

In [18]:
Htcₗ[(col -> (col .== :C) | (col .== :C_hat_0))(Htcₗ[:variable]), :]


Out[18]:
variablevaluecount
1C9.00
2C8.01
3C5.02
4C1.03
5C0.04
6C0.05
7C0.06
8C0.07
9C0.08
10C0.09
11C0.010
12C_hat_07.9599527876083980
13C_hat_09.4219849322711651
14C_hat_05.5762767966502822
15C_hat_02.2001636340524923
16C_hat_00.65106883048492114
17C_hat_00.154130580278062955
18C_hat_00.0304067131160804466
19C_hat_00.0051416599438270157
20C_hat_00.00076075580801522158
21C_hat_00.000100054051847806899
22C_hat_01.184313266769959e-510